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Summary 
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Many automatic document retrieval systems represent documents and 

t * 

requests for documents as numeric vectors that indicate the subjects 

o 

treated by the document or query. This study investigates relevance feed- 



back, a process that allows user interaction with such a retrieval system. 

» 

i „ 4r ' ff/ 

The user is presented with a small set of possibly relevant documents, and 
is asked to judge each as relevant or nonrelevant to his request. The nu- 
meric vectors representing the judged documents are used to modify the nu- 



meric vector representing the query, and the new query vector *Ls used to 



retrieve a more appropriate set of documents. The relevance feedback pro- 



cess can be iterated as often as desired. Several feedback algorithms are 



investigated in a collection of 200 documents and 42 queries. Two distinct 

* * 

> * 

viewpoints are taken in evaluation; one measures the movement of the query 



vector toward the optimum query defined by Rocchio, the other measures the 

/ 

retrieval experienced by the user during the feedback process. Several 



performance measures are reported from each evaluation viewpoint. Both 

< 

\ 

evaluation methods indicate that relevance feedback is an effective process. 
All ajjgorithms tested that use only relevant document vectors for 



feedback provide equally good retrieval. Such algorithms should supply 



X 



additional documents to any user who judges every document presented for 
feedback to be nonrelevant., Algorithms using nonrelevant document vectors 
for feedback improve the retrieval obtained by these users without requiring 
additional relevance judgments. 
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The relevance feedback algorithms tested are based orKthe assumption 



that the vectors representing the documents relevant to a query are clustered 



/ 



in the same area of the document space. The conclusion that no tested rele- 
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vance feedback algorithm is completely appropriate for the experimental 

' ' ’ * . • . ,, . 

* „ \v < ■ 

environment is supported by a hypothesis that explains the observed con- 
trasts -between the behavior of strategies using only relevant documents for 

v " . • 'v/> 

feedback and that pf strategies using nonrelevant documents. This hypothesis 

\ % * 

‘ 

states that for mo^t- queries, some relevant document vectors are separated 

0 

from others' by one or more nonrelevant document vectors. The implications 

- . i / - . 

of this result for future research in relevance feedback, partial search or 

’ *■ \ 

multi-level' strategics, multiple query strategies, request clustering, and 

« • i r ■ ■ ■ 

document vector modification are discussed, and useful evaluation measures 

* # 
and new algorithms t for these areas are suggested. 
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Synopsis 



The present proliferation of technical and scientific books 
and articles is too much of a good thing. The so-called "information 
explosion" has created a need for new methods of document classifica- 
tion and Retrieval with the following characteristics: 



t s 1 ' * 

1) The number of expert indexers and librarians obtainable 
is completely inadequate for processing such a va'tame" of " 
information, even in a cursory manner. An automatic .com- 
puterized system is needed. [1] 

2) Even with high speed computers, the retrieval operation 




i 



3) The user should not be required to understand the detailed 
operation of the system. For this reason, the system 
.should ideally respond to information requests in natural 
language . 



This study deals with an experimental automatic document re- 
trieval system that meets these requirements. Within this experi- 
mental system, several retrieval algorithms that permit user inter- 
action with the search process are evaluated. All of these techniques 

j 

employ user judgment of the relevance of certain selected documents to 
lj^is request, and are called "relevance feedback" algorithms. 

Section I of this report describes several methods of auto- - 

( 

matic document retrieval, and details the experimental retrieval sys- 

v w . 

tern used in this study (the SMART system [31). Section II examines 

t 

several means of user interaction with an automatic retrieval system, 
and summarizes the results of some prior experiments with user inter- 
action in the SMART system). In Section III the results of earlier 

xii 



! 



\ 



experiments with several relevance feedback algorithms are presented. 
Section IV details the features of the experimental environment of this 

'study. In Section V, means of evaluating the performance of an inf or- 

♦ 

mation retrieval system are discussed, anJ the evaluation measures and 
statistical tests used in this study are described. Section VI. contains 



VI-A relevance feedback results in .two document collections are compared. 
Section VI-B compares feedback algorithms that use only information from 



queries. In Section ^.VII/ the last section of this study, the results 



the results of relevance feedback experiments in five areas. In Section 



f 

the documents judged ’/relevant^ by the user, and Section VI-C examines 

v 

the effect on these algorithms of^the number of documents used for feed 




back. 



that use information from 



relevant and nonrelevant\ document^ and Section VI-E further studies 

■\ \ 



, \ 

the comparative usefulness of these strategies for different types of 



«*v-» y 

of these relevance feedback experiments are used to support recommen 
dations for interactive document retrieval systems, and to suggest 
guidelines for future experiments with regard to relevance feedback 



algorithms, evaluation of feedback performance, partial search stra- 



tegies, multiple query strategies, request clustering, and permanent 



document vector modification. 



• xiii 



Relevance Feedback 

In An Automatic Document Retrieval System 



Mi 






aMM 



1-1 



Chapter I 

Automatic Document Retrieval Systems 
The conventional library classifies documents by numeric, subject 

I 

* N 

codes which are assigned manually (Dewey decimal system, Library of Congress 

system). Cross-indexing is^provided in a card file by* "subject" title, 

. . J S' * ■ 

and author. Both the numeric index and the "subject" cross-indexing may 

be inadequate for retrieval. - A book on the intersection of two subjects 



(e.g., "The Aerodynamics of Birds") or on a new subject (e.g. , automata 

\ * 

theory - — is it mathematics, computer science, logic?) is hard to classify 
and therefore hard to find, unless the librarian is asked where he filed 
it. A fully automatic system must duplicate the function of this librarian 

by extracting information from a natural language request and retrieving 

* * > 

precisely those documents most likely to be needed by the requestor. 

One method of subject classification that is used ‘in automatic 
retrieval systems assigns to each document a list of subject identifiers, 
often called "keywords". This list can be treated as a binary vector by 
associating a position-, in the vector with each possible keyword in the 
retrieval system. The vaiub in a vector position is one if the \ associated 
keyword is assigned to the document described by the vector, zero other- 
wise. Retrieval systems operated by NASA and by the National Library of 
Medicine (Medlars) use this type of subject classification [2]. In both 
of these automatic retrieval systems, keywords are assigned to documents 
manually by subject experts. 

An extension of keyword indexing represents each document as a^/ 
pbsitive weighted concept vector rather than a binary vector. Each cl^ssi- 
fication concept is weighted to indicate its importance in the document. 



I 



In the SMART retrieval system, these concepts and weights are assigned by 



automatic processing of. the natural language text of each document or 



/ 



abstract [3], 



/ 



The user's query in an automatic informatipn retrieval system can 

* 

take several forms. The user may be asked to formulate his query using 
a restricted language. This language usually includes the set of keywords 

A 

defined for the collection and sometimes the Boolean operations "and", 



"or", and "not". In the NASA system the user can assign values to each 
keyword instead of using the logical operations. Cross-referencing and 
hierarchal relationships among keywords can be used in both the NASA and 
Medlars systems to refine or expand the user's initial query. Both systems 
require the user to understand the indexing system in order to formulate 
effective search requests. 

In the SMART retrieval system, the user is asked to phrase his 
query in natural language. The query is then processed in the same way as 
the document abstract, and a query concept vector is created. Logical 
relationships are not used in the query analysis. 



SMART provides a iiully automatic information storage and retrieval 
system of relatively simple form. Document abstracts are analyzed to 
construct representative concept vectors which are stored in the computer. 

When a user types a natural-language request into the system, it is con- 

<■* ' * 

verted to a concept vector representation in the same manner. Several 

types of automatic text-to-vector conversions have been used with the 
• * # 

SMART system [3]. A list of common words to be ignored in constructing 



r 







•the concept vector is provided. A suffix dictionary is used to reduce all 
words to word stem form. A word stem thesaurus which treats each distinct 
word stem as a concept in the concept vector is used as an experimental 



standard. Frequency characteristics may be used to eliminate some concepts 
("partial- stem thesaurus"), for example, words occurring* less than five or 
more than 100 times in a given collection may be eliminated. This study 




uses a - thesaurus, which was constructed semi -automatically for the subject 

t 

area of > aeronautical engineering. This "regular thesaurus" recognizes 

\ , 

synonyms; that is, it converts words of the same meaning to the seme con- 
cept, providing better retrieval performance than the stem thesaurus [4]. 

/ The degree of relationship between a query and a document is deter- 



mined in the SMART system by some "distance function" of the query and 
document concept vectors. The most effective of the distance functions 
tested in SMART appears to be the cosine correlation, which measures the 
angle between concept vectors in n-dimensional space [4,5] . The cosine 
coefficient of two concept vectors ranges from 0 to 1, and is found by the 
formula 

n 

E r . s . 

/ \ i 1 1 

cos (r,s) = 1 




The process of determining the relationship of each" document in the 
collection (or some subset thereof) to the user's query is called a "search 
operation. The distance function is used to assign to each document a 



I 
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correlation coefficient indicating the relationship between the concept 
vector for that document and the query vector. The document identification 
numbers are then ordered by correlation coefficient and are assigned ranks 
from one to N (number of documents being searched) for evaluation purposes. 
The document cost closely related to the user’s query is assigned the ^rank 
1 (considered the "highest" rank). 

^ The retrieval algorithm is the goal of any information retrieval 



system. For each query, the system must produce a set of documents relevant 
to the requestor’s need. In the SMART system the retrieval algorithm can 
be varied experimentally. 'The retrieval algorithm applies the search 
operation; it may select subsets of documents to be searched, and it may 
conduct several search operations in response to one user request. . The 
details, of the retrieval operation are the primary concern ’of this study. 

One of the simplest retrieval algorithms.- here called the "full 

search" algorithm, performs one search operation using the entire document 

/ 

• * „ \ ✓ 

, * \ t 

collection, and selects for retrieval the highest n documents in the 

-i % \ 

\ . 

ranked list resulting from the search oparatibn.. In an operating system 

\ • 

each user could select this n; in the SMART system n is an experimental 
parameter. * 
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Chapter II 

User Interaction With An On-Line Retrieval System 
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The full search retrieval algorithm returns to ttie user the n 
documents with concept vectors "closest" to the query- lector as measured by 
the angle between vectors (cosine correlation). If the user's original 
query is an accurate and complete description (in "concepts") of his need, 
and if the documents relevant to the user "are' clustered "close" together in 

X 

the space of concept vectors, this algorithmic^ isolate these few N relevant 
items from a large collection of irrelevant material.; However, neither of 
these conditions is common in practice. It is evident from experiments 

with the SMART system that a user familiar with the subject area but Unaware 

\ •' < . 

of vocabulary and word frequency effects on the search process is unlikely 
to formulate an initial query that provides optimum retrieval [6]. It is 



unreasonable, however, to expect each user to understand the fine, details 

• ■ , ^ ' ■ • 

of the document classification system. . I 

Further, there is evidence in the experimental document collection. 

used here, that the documents judged relevant by the users are not always 

0 > 1 , 

clustered neatly in the concept vector space. Even with full knowledge 

of the document collection it is often impossible to formulate a single 



q ue ^y that will rank all relevant documents above all nonrelevant (docu- 
ments. This may indicate flaws in the text-to-vector mapping used for 

!V . ' : ( : 

this study. However, the needs of the human users of document • collections 
are so diverse that a subject classification system appropriate for all 

■ 

queries may not exist, or may be impractical to implement. 

' Since the user's original query is often inadequate, some sort of 
; user interaction with the retrieval operation is desirable. The user of a 






".I \ \ A, 
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manual retrieval system such as a library might at first ask a general and 
unclear question. The librarian,, using his knowledge of the document 
collection, might then ask the user a few questions and show him a few 

<x- “ 

books in an attempt to pinpoint his needs. Recent technological develop- 
ments encourage the investigation of similar types of user fee db ack in 



automatic retrieval systems. Large capacity random access memory devices 
allow the storage of natural language document titles and abstracts. On- ' 
low speed terminals and time— sharing techniques may be used to provide 

real time interaction with many users a.t once, at several convenient loca- 

* 

tions. 

Two major considerations arise in such an on-line system. In the 
present, batch-processing systems, such as NASA and Medlars*^ [2] immediate - 

' i 

response to the user is not necessary. In an interactive system the com- 
puter time required to process a single query takes oh a. new importance. 

The low input-output speeds of those terminals appropriate for inter- 

V ( ■*. 

active applications introduce a second limitation .. [7] . For example, typing 
out a single document abstract on a typewriter terminal could easily con- 



V 



sume more time than the computer retrieval operation. An interactive 



I. . 



document retrieval system therefore requires an efficient retrieval algo- 
rithm and af^nimum of necessary interactive input and output. 

Several methods'^of^ user interaction have been tested in the SMART 
system using the document collection empl^edT in this study (the ’Cranfielc 

' _ ’ * e 

200' collection described in Section IV) . Results of this investigation [6] 
are summarized below. 
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-The interactive Strategies tested can be divided into pre-search 
and post-search algorithms. In pre-search interaction, information is 



presented to the user and a new query is constructed by him before the 
search operation takes place. The "repeated concepts" algorithm asks the 
user to choose one or more of his query terms to be repeated for emphasis. 

The "word frequency" technique displays for the user the frequencies with 

- '''• v ' * 

which his query terms occur in the document collection. , The user is then^ 

invited to eliminate or change query terms that are too common or too 
to be useful for retrieval. Both of thpse displays help the uninformed 
user to take advantage of the effects of wo^d frequency in a retrieval 
system using frequency-weighted vectors for document classification. The 

"thesaurus display" supplies synonyms and terlns related to the terms of uhe 

■ / . . ■ . • ; 

initial query from a stored thesaurus a appropriate to the subject area. The 

/f ' • ' ' . 

thesaurus used for this display in reference 6 is the "regular thesaurus" 

described in Section I of this report. Since the same thesaurus can be ^ 
incorporated automatically into the SMART system, manual and automatic 
thesaurus procedures are compared in reference 6. The automatic applica- 
tion of the thesaurus to document and query vectors gives better retrieval 
results than the manual thesaurus display, except at low recall levels. 

The "source document display" exhibits concepts assigned to a relevant 
document known to the user before retrieval. When this display is used in 
addition to the automatic thesaurus, results are better than with automatic 



thesaurus alone. 



N 



Post-search techniques display the partial results of an initial 



search operation so that the user can reformulate his query and request 
another search. These algorithms may be iterated as often as the user 
desires. All post-search algorithms share a common disadvantage, the com- 
puter time required for several search operations. The time qs^ well spent, 
however, for all post-search techniques investigated give better retrieval 

c 

than automatic thesaurus display. "Title display" which displays the 

titles of the first n (in this reference n=5) , documents retrieved by the 

initial query, provides better retrieval than thesaurus display except at 

high recall. "Abstract display", which displays n full abstracts, requires 

more output time and more time for user thought, but gives consistently r 

better performance than title display. A variation of "relevance feedback", 

the technique investigated in this study, gives retrieval results nearly 
* * 
comparable to abstract display. Moreover, this report gives more effective 

variations of the relevance, feedback algorithm than the version used by 

. * 

Lesk and Salton.* When pre-search and post-search information is combined, 

manual thesaurus display followed fc by abstract display gives batter retrieval 

* 

than either method alone. Adding word frequency information to the combi- 
nation is helpful when the word stem thesaurus is used. 

Estimates of the search cost per query show that abstract display, 
which gives the .best overall performance of the methods tested, is the 
most expensive. The other post-search algorithms, title display and rele- 
vance* feedback, are more costly than any pre-search method. Relevance 
feedback requires the least user effort of arfy post-search strategy. Lesk 



* Lesk and Sal ton use the Q Strategy with N equal to 5. 
See Sections VI-C and VI-Dror more erf feet ivy algorithms, 
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a 



and Salton [6] recommend the following algorithms; 

\ 

a) For normal users needing high recall, automatic thesaurus 
followed by automatic relevance feedback. 

b) For highest precision when high recall is notJ required, 
word stem matching followed by' title display. 

c) For experienced and patient users needing maximum perfor- 
mance, thesaurus display plus frequency information followed 
by abstract display. 



The Lesk and Salton study shows that relevance feedback is one of 

the most effective usei: interaction techniques. In relevance feedback, 

the user is given a small set of items retrieved using his original query. 

V* 

He is then asked to judge which items of this set are relevant to his needs. 

* i 

This information is used to automatically produce a new query for another 
search. This feedback process can be iterated as often as desired. Rele- 
vance feedback has a definite psychological advantage over abstract display; 
the user is not required to make sophisticated decisions in rephrasing his 
own query. Instead, he can supply much information to the .retrieval system 
little effort by saying in effect "I want documents on the same subject 
as this document". The stored abstract of a chosen relevant document con- 
tains a more detailed description of the subject than a user would care to 
type as a query. In the experimental collection, the document vectors 
have approximately ten times as many concepts as the query vectors , so the 
user submits a ten times more detailed "query" simply by typing a document 
identifying number. 



I 
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The /disadvantages of relevance feedback should be reiterated. Like 

V 

abstract display, relevance feedback requires the system output of document 
abstracts or information of comparable detail. Also, multiple searches of 
the document collection are made. Designers of retrieval systems must 
decide whether the extra output time and computer time is justified by the 
retrieval improvements obtained. 

J 



\ 
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Chapter III 
Prior Investigations 

Of The Relevance Feedback Retrieval Algorithm 



Rocchio 18,9, 10 J suggests an algorithm for relevance feedback based 
on the properties of the distance function used. If the set of relevant 
documents is known, the query that will be "closest" to this set of docu- 
ments and furthest from the set of non-relevant documents can be formed. 

\ 

If the cosine correlation is used as a distance function, this ideal query 
is 




N 

r 



N 




/ i 2 

V(sV 




where each r 1 is the vector describing a document relevant to the user's 
query, and each s 1 is the vector describing a document not relevant. Thus 
N r is the number of documents in the collection that are relevant to the 
request, and N g is the number of non-relevant documents, or the remainder 

of the collection. ''if ' ' 

\ 

This ideal query is useless for retrieval, because if ihe docu- 
ments relevant to each request were known, a retrieval operation would not 
be needed. Rocchio suggests that t»* ideal query might be approached by 
iteration. The user is asked to mtuce relevance judgments on a small 
retrieved set of documents, and this set is used to update the former 

I 

query as follows: 



t 




n n q„. 
r s^r 





(A) 



where n and n are the nuinbers of 'relevant and non-relevant documents 

r s , * / 

retrieved by the previous search operation, 

Rocchio investigated relevance feedback using .formula A and the 

t * 

SMART retrieval system [9] * A set of seventeen natural language search 

requests and a collection of 405 abstracts of articles published in IRE 

• u 

Transactions on Electronic Computers (March- September, 1958) were indexed 
using a SMART regular thesaurus (Section II), Relevance judgments for 
the sample queries were constructed by a manual search of -the entire docu- 
ment collection. Average retrieval results for the collection described 
are improved by two iterations of the relevance feedback process described 
in formula A, Rocchio suggests construction of multiple queries when the 
documents desired are not clustered in the document vector space. 

Another investigation of a relevance feedback system was based on 

** ' . 

the 'ADI collection”, a collection of 82 documents presented at a confer- 
ence on documentation. Thirty-five queries were constructed for this 
collection, and the documents considered relevant to those requests were 
specified .by the two originators of the queries. The investigation of 
relevance feedback in the ADI collection was conducted by Riddle, Horwitz, 
and Dietz [11] . They used 22 of the 35 queries and studied a slightly 
different algorithm for modifying the search query. Their formula is; 



Q.+l 

i 




+ a 

* 





(E) 



Three differences from Rocchio' s formula are immediately apparent; 
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a) 



The descriptor vectors are not normalized by their length. 

In Rocchio's formula, the change to the weight of concept, 
a in the query depends not only on the weight assigned to 
concept a in a retrieved document vector but also on the 
length of that document vector; that is, on the number of 
other concepts and on the magnitudes of weights in the 
document vector. This is not the case in the Riddle, 

Horwitz, and Dietz formula. When the latter formula is 
used, for instance, a document with generally highly weighted 



generally lower weighted concepts, the number of concepts 
being equal. Weight magnitudes being roughly equal, a 
document with more concepts changes the query more than 
one with fewer. Rocchio's formula compensates for these 
effects. 

The parameter a, which is the one variable in the above 

formula, is constant for all queries. Rocchio's formula 

uses a different multiplier for each query; the multi- 

plier being dependent on the numbers of relevant and non- 

relevant documents retrieved (n and n ) . 

r s 

The non -re levant documents retrieved on the previous 
.iterations are not used to update the query. However, 

Riddle, Horwitz, and Dietz tested a "negative heuristic 
strategy" which uses the two non-re levant documents 
first retrieved (the two which the system falsely judges 
most relevant to the query) to update those queries that 
retrieve no further relevant documents on the first feedback 
■iteration. For such queries the formula becomes: / 




concepts changes the 4.\iary more than* does a document with 



n 



2 



r 





r . 
1 



1 



(C) 
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The feedback algorithm of Pdddle, Horwitz, and Dietz produces an 
improvement in performance on most of the queries tested. The three 
experimenters recommend that the variable a in their formula be set to 1 
for the first iteration and then increased by 1 for eachv subsequent 
iteration (called "increasing alpha strategy"). They also recommend 
their negative heuristic strategy (formula C) . 

Crawford and Keizer (12'] have tested a relevance feedback stra- 
tegy that ignores the original query after the initial search if a rele- 
vant document is found. Their algorithm is: 

* 

* 

\ 

If at least one relevant document is retrieved within the 
first n documents, the original query is ignored and one additional 
relevant document is used for each iteration: 



^i+1 r i 



But if no relevant documents are retrieved within the first n 



*2 " 2 1 - S 1 



On iterations after the first, the second formula of their strategy is not 



used. 



<\ 



Using the Cranf ield 200 document collection (Section IV) , Crawford 



and Melzer found their strategy superior to formula B when a = 1. 

Steinbuhler and Aleta 113] have tested Rocchio's algorithm in the 

\ 1 

ADI collection, for the 'worst < sa\ when only non-relevant documents are 
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available for feedback after the first search operation. The information 
from non-relevant documents alone, when used to modify the query according 
to formula. A, efives better retrieval than the initial query. 

Kelly [14] proposes an addition to the relevance feedback algo- 
rithm when nof'relovant documents are retrieved. He points out that in 
these cases no new concepts are added to the query, and recommends adding 
concepts that occur frequently in the document collection. He tested this 
recommendation on sets of artificially constructed 'query' and 'document' 
vectors with success. However, Steinbuhler and Aleta [13] found that 
adding frequent concepts to the query degraded performance in the ADI 
collection. 

The results reported in Section VI of. this study give further •*. 
insight into the problems investigated by the five earlier studies cited. 
Section VII uses both the earlier studies and the present report to 
support recommendations for document retrieval systems. 



) 
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Chapter IV^ 

Environment Of The Reported Experiments 

* 

■I 

The present study of relevance feedback compares several related 

t! * 

fprmulas for query modification. The following general formula is avail- 
able to the experimenter: 



min(n , n ) 
a r 



mintn^, n g ) 



Q.+l = tt 0. + u)Q + a V r. + y \ 

z_ 1 2_ 



s . 

1 



(D) 



where n + n (see formula A, Section 1) equals N, the number of documents 
retrieved for feedback. 

. The experimental variables are a, w, tt, y , n ,. n , and N. The para- 

meter a is positive, and weights all incoming relevant documents relative 

to other contributors to the query (previous query, initial query, non- 

relevant documents). The parameter tt permits the previous query to be 

increased in weight relative to the incoming documents. Q is the initial 

o . 

..query , as opposed to the query of the previous iteration; u) permits the 
‘'initial query to be used as part of the new query (see Section 3B) . The 
parameter y should theoretically be negative, as it permits some signifi- 
cance to be attached to the non-relevant documents retrieved. The para- 
meter n^ (n^) permits some specific number of relevant (non-relevant) 

documents to be used in the query even if n (n ) is larger. It is 

r s 

assumed that the r^ and s^ are indexed in order of decreasing relevance 

(as determined by the system) to the query; that is, the n relevant 

a 

t i ^ 

documents (or n^ non-relevant documents) used in the new query will be 
those closest in the descriptor space to the previous query. The flexi- 



■ r 
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bility of this formula permits the investigation of several feedback 
strategies. . . ' ■ ' 



The system also provides the following formula to simulate Rocchio's 
algorithm: \ 

' \ . ■ 



Formula E does not normalize the vecton lengths as is done in Rocchio's 
algorithm'- (formula A). 

The document collection used in this study (the "Cranfield" 
collection) contains 200 documents from the field of aerodynamics, chosen 



constructed by some of, the authors of the 1400 documents; these requestors 

I . 

are also responsible for the relevance’ judgments. 

The concept vectors describing document and queries are' quite 
sparse for the "Cranfield" collection. The maximum number of concepts 
used to describe one document is 85,, out of a possible 552 concepts. The 
largest weight given to any concept in any document descriptor is 288. 

The query description vectors are sparser by one order of magnitude and 



shorter than the document descriptors. The maximum number of concepts 



is 24. The largest number of documents relevant to a single query is 12, 



V min(n , n ) min(n , n. ) 
r a s b 




(E) 



from a library Of 1400 documents. For this* collection, there are 42 queries 



used in a single query vector is 13; the largest weight in any query vector 



or six percent of the collection. The comparative brevity of the query 
vectors in this collection is typical in technical document /retrieval 




W rnaiiti -nail H 
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becauBe document abstracts generally contain more detailed information than 

v i ' ' 

user queries. 



The characteristics of an experimental document collection determine 
i extent to which experimental results typify retrieval behavior in 

.document col lect ions .'^The three collections described in this study 
are much too small to require sophisticated retrieval techniques. However, ; 

* ■ , j. 

the Cranfield 200 document collection is more realistic than the ADI and 
IRE collections for three reasons. ! 

First, more, queries are available for the Cranfield 200 collection J./ 
The number of queries available is important in judging the statistical / 

i 

significance of experimental results. 

. . ' ->A 

Second, the documents in t.he Cranfield 200 collection were chosen 

from a more typical environment. The ADI collection consists of short 

- r I 

papers all presented at the same conference. The papers in the IRE 
collection/ were all published in the same magazine within a seven-month 

i ■ 

( • . 

period. By contrast, the 1400 documents in the full Cranfield collection 

I •• . ■ 

were in effect selected by knowledgable authors from the field of aero- 



dynamics, and the 200 documents in the small collection were chosen to • 
represent the larger collection. 

Third, the queries and relevance judgments in .the ADI and IRE 



collections were constructed by a small number of -information retrieval 

experts, while those in the full Cranfield collection were constructed by 
182 authors of recent papers in aerodynamics. 



Concept vectors for both the ADI and Cranfield collections 





were 



V-l 



Chapter V 



, Evaluation Of Retrieval Performance 



/ 



A. Measures of Performance 

Several average measures of the performance of the tested retrieval 



algorithms on the 42 Cranfield queries are' used in this report. Each 



.an information retrieval system, an arbitrary cut-off point, such as rank 



ten or cosine correlation 0.75, is often employed. Documents above this 
cut-off point in the ranked list resulting from a search operation are 

4 % 

considered, "retrieved”. With such a cut-off, recall is the precentage of 



percentage of retrieved documents that are relevant. 

An ideal retrieval system would provide recall and precision of 
100%, indicating that all relevant documents are retrieved and no non- 
re levant documents are retrieved. In SMART experiments an inverse rela- 
tionship between recall and precision is observed, such that high recall 
implies low precision and vice-versa. 

The 'document curves' used in this report are graphs of recall 

and precision at several cut-off points based on rank; that is, recall 

% 

and precision after x documents are retrieved, for several values of x. ' 



The other measures used are not based on specific cut-off points, but in 



a sense measure retrieval performance over the entire document collection. 



proposed by Rocchio 19] t that take the average recall and precision 



4 ments in the collection, R. is the recall at a cut-off of j documents 



measure is based on the concept of "recall" and "precision". In evaluating 



documents re+svant to the user that are retrieved, and precision is the 




Normalized recall and normalized precision are two measures 



obtained for all possible cut-off points. If N is the number of docu- 



• 3 



y-2 



/* 



(rank j) and P.. is the precision at a cut-off of j documents, normalized 
recall and precision are defined as follows [15] : 




NP 




N 

L 

j * i 



p. 






For automatic calculation, the following approximations are used 
iri the SMART system [15] : 



nr = i - E r i' I * 

i = 1 i = 1 

n (N-n) 



o 

KLC 



n 



n 



NP = 1 



- n inr i- r 



In i 



i = 1 



i = 1 



In 



- 



N1 



! (N-n) I 



where r is the rank of the i relevant document in the collection and n 
i 



is the number of relevant documents in the collection for the given query. 



1 



A normal overall measure of retrieval performance has been suggested [15] . 
but is| not explicitly displayed in this report: Normal overall measure — 

I * 

1 - 5 tfJR + NP. The factor of 5 gives equal weight to the two component 

\ | • 

measures. 



mm 









Another overall measure used in many studies of retrieval' perfor- 
mance is the recall-precision curve, an average plot of precision at each 
5% or 10% of recall. Each query is averaged into each point of the plot. 

To accomplish this averaging process, an interpolation procedure is needed, 
since, for example, a query with two relevant documents can only achieve 

uninterpolated recall levels of 50% and 100%. , 

Two types of recall-precision curve are used in this study. They 
are distinguished by the method of interpolation used. Both the Quasi- 
Cleverdon interpolation used in several previous studies and the Neo- 

Cleverdon interpolation now used for all evaluation of the SMART system 

/ * , 

are described below. 

\ 

Figures 1 and 2 show two graphs for a hypothetical query having 4 
relevant documents. The relevant documents are assumed to be retrieved 

, i 

with ranks of 4, 6, 12 and 20. Thus, at 25% recall, the precision is 25%, 
a SO%-recall, the precision is 33%, and so on. However, these values 
correspond actually to the highest possible precision points, since they 
are calculated just after a relevant document is retrieved. In this 
example, after 3 documents are retrieved, the precision is 0%, after 5 
documents, the precision is 20%, and so on. This range of precision for 
each recall level is indicated by the top and bottom points in Figures 1 
and 2 at 25%, 50%, 75%, and 100% recall. The zo2 id sawtooth line connecting 
these points is not used for interpolation? it is intended to indicate the 
drop in precision between the actual recall levels for this query as more 
non-relevant documents are retrieved. 



ai mh 
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The Quasi Cleverdon interpolation uses a straight line between peak 
points of precision, as indicated by the dashed line in Figure 1. It has 
been argued that this interpolation is artificially high, since it lies at 
all points above the saw-tooth curve, and thus, does not reflect in any 
way the precision drop as more non-relevant documents are retrieved. The 
Neo-Cleverdon interpolation of Figure 2 projects a horizontal line leftward 
from each peak point of precision, and stops when a higher poi ^t of pre- 

%r • 

cision is encountered. This new interpolation curve (the dashed line 
jn Figure 2) does not lie above the saw.-tooth curve at all points. When 
the precision drops from one recall level actually achieved to the next, 

an immediate drop in precision after the first point to the level of the 

•\ 

next point is indicated. For example, in Figure 2, the precision value 
at 50% recall is 33%, but at 55% recall, the interpolated value used for 
the new averages is 25% precision. When the precision rises from%ne 
recall level to the next, however >^the first precision point actually 
achieved is ignored for purposes of interpolation. The achieved pre- 
cision of 25% at 25% recall in the example of Figure 2 is ignored, and 
for all recall levels from 0 to 50%, an interpolated precision of 33% 
is used for the new averages. The proponents of the new interpolation 
argue that this method indicates in all cases a precision that the user 
could actually achieve, if he were to use clairvoyance to retrieve 
exactly the right number of documents. 




B. Statistical Significance Tests 

Several statistical tests are reported here using as input the 










rank recall, log precision, normalized recall, normalized precision and 10 

points from the feedback effect (Section V-C) recall-precision curve with 

v 

the Neo-Cleverdon interpolation. The statistical tests are intended to 
measure the "significance'' of the average difference in values of these 
measures obtained for two iterations or two distinct search algorithms. 

The test results are expressed as the probability that the two sets of 
values obtained from two separate runs are actually drawn from samples 
which have the same characteristics. A small probability value thus 
indicates that the two curves are significantly different. If this pro- 
bability for one measure is, for example, 5%, the difference in the two . 
average values of that measure is said to be "significant at the 5% level". 

Choice of a -statistical method for calculating this probability 
is important. The present study uses three statistical tests, the familiar 
T-test, the Wilcoxon Signed-Rank Test (WSR) , and the Wilcoxon Rank Sum 
Test (WHS) [16]. 

/ 

The T-Test and the Wilcoxon Signed-Rank Test are used in this 
report to compare the retrieval of one feedback iteration to another or 
of one algorithm to another, using all queries. The T-Test takes account 
of the magnitude of the differences, and assumes that the measures tested 
are normally distributed. The WSR test does not make this assumption. 

Moreover, the WSR test takias account only of ..the ranks of the differences, 

/ 

ignoring their magnitude. Because this test does not assume normality 
of the input and because it ignores some information (magnitudes of 
differences) , the WSR test is more conservative than the T-Test. 



It is 
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therefore less prone to the error of calling a result "significant" when 
it is not. Because information retrieval provides discrete rather than 
continuous data, and because only 42 data points (42 queries) are provided, 
the more conservative WSR test is preferable for the present evaluation. 

The Wilcoxon Rank Sum Test can be used to test unpaired observa- 
tions, and is used in Section VI-E of this study to compare one subgroup 
chosen from the 42 queries to a contrasting subgroup of queries. Like 
the WSR test, the WRS test ignores the magnitudes of the results and does 
not assume a normal distribution. i 



C, The FeedbacJ^ Effect in Evaluation 

The assignment of ranks to documents retrieved for feedback is a 
key factor in the evaluation of retrieval performance. Two methods of 
assigning these ranks have been proposed, and both are used in the present ' 
study. Hall and Weiderman [17] compare and evaluate these two methods. 

In previous feedback investigations, all documents in the collection 
received new ranks after each iteration and the top-ranked N documents were 
used for feedback. Hall and Weiderman point out that evaluation of this 
retrieval technique takes into account two effects, which they call "ranking 
effect" and "feedback effect". 

V ' 

\ , 

Relevance feedback in effect uses information from one or more 

document descriptors to modify the query descriptor.' “The relevant docu- 

\ 

* \ 

ments used for this purpose will be ranked higher by the modified query 

. I 

(f # \ i 

than previously, and t^e non-relevant documents used will i>e ranked lower. 
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The effect of these rank changes in "retrieved" documents is termed the 
"ranking effect", if the ranking effect is included in an overall perfor- 
mance measure, the measured change in performance between feedback itera- 
tions is 'quite impressive. 

This large change in "total performance" (including both ranking 
and feedback effect) indicates the extent to which the initial query has 
been perturbed toward the centroid of the relevant documents, and strongly 

J 

supports Rocchio's theory. v 

Hall and Weiderman state that in an environment where the user must 
actively supply relevance judgments for feedback, changes in the ranks of 
documents which the user has already seen are of no interest to him. The 
user in such an environment is concerned primarily with the "feedback 
effect"; that is, the effectiveness of the modified query in bringing new 
relevant documents to his attention. They conclude that, though total 



performance is a valid measure of the effectiveness of relevance feedback 



in approaching ti*e " ide al,^ery^^the feedback effect should be isolated 



and examined as well. 



The present study evaluates total performance and also measures 
feedback performance in the manner suggested by Hall and Weiderman, dis- 
carding the ranking effect and presenting only the feedback effect. The 
ranks of the top N documents retrieved in each iteration (the documents 



used. for feedback) are "frozen" in all subsequent iterations, and only 
the remainder of the collection is searched using the modified query. 
Thus, in feedback effect evaluation, the N documents retrieved on any 



iteration are guaranteed to be N new documents? that is, documents not 

♦ 

used for feedback on any previous iteration. Moreover, the performance 
measures for the first (second, third) iteration are calculated from a 

ranked document list in which -the top N <2N, 3N) documents are the same 

0 \ * * 

as those retrieved previously. Only the changes in the ranks of documents 
not yet seen by the user is measured. 

Feedback effect evaluation gives overall results that are decep- 
tively low. Because the top ranks are frozen, no newly retrieved docu- 
ment can achieve a rank higher than that of any previously retrieved docu- 
ment. with a constant feedback strategy, therefore* on the first (second, 
third) iteration, the highest possible rank for a new document is N+l (2N+1 
3N+1) . For this reason, the feedback effect evaluation is a misleading 
measure of the overall performance of the retrieval system, and should be 
used in conjunction with other evaluation methods. Isolation of the feed- 
back effect i3 primarily useful to compare different feedback strategies 
from the viewpoint of a user. in an interactive retrieval environment. Fig- 
ure 35 in Section VI I-B compares total performance and feedback effect 

evaluation of similar feedback algorithms . 

, *• 

However, one feature of feedback effect evaluation is psycho- 
logically essential to a realistic relevance feedback system; the guar- 
antee that the N documents retrieved on any iteration have not previously 
been seen by the user. For this reason,, ne;/ evaluation methods that , 

f ... 

provide this guarantee without severely limiting the attainable .retrieval 
performance should be investigated. Several such methods are discussed 



in Section VII-B. 

The results reported in this- study include: 

Total Performance: 

- 1. Normalized recall and precision. 

2. Recall-precision curves with Quasi -Cleverdon interpolation. 

Feedback Effect: 

1# Normalized recall and precision. 

2. Recall-precision curves with Neo-Cleverdon interpolation. 

3. Document curves at several cut-off points. 

4. T-tests and Wilcoxon Sign'ed Rank tests of the normalized 

measures and of recall-precision curves, „ - * ' 

5. Wilcoxon Rank Sum tests of normalized recall and precision. 

i *' 
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Chapter VI 



Experimental Results 



The resu lts this study are presented in five general sections: 



a) A comparison of the improvement in retrieval performance' 
observed for the Cranfield 200 document dollection with 
uhat obtained for the ADI 82 document collection used 
by Riddle > Horwitz, and Dietz [11). 

b) ^ investigation of strategies that use only R' , the set 
of relevant documents retrieved, to update the query. 

The different algorithms are obtained by varying the 
parameters tt, to, and ot in the query-update formula. 

1 increasing alpha strategy !l of Riddle, Horwitz, and 

Dietz is included among the methods tested. 

* 

c) An investigation of the effect of the number of docu- 
ments given to the user for feedback on each iteration. 

i 

d) An investigation of strategies that use. both relevant 
a - ad non- re levant documents retrieved to update the query. 

e) An investigation of the retrieval characteristics of 
selected subgroups of queries. 

A. Comparison of the Cranfield and ADI Collections 

\ " * * 

The initial search results, before feedback, for the two collections 

are essentially the same except at the ends of the recall-precision curves. 
Below 30%' recall, the precision of the ADI initial search is from 2 to 7% ‘ 
better than that of the Cranfield initial search. Above 80% recall the 




precision in the Cranfield initial search is from 2 to 6% better. 



This result is interesting because there is reason to expect that 
performance in the Cranfield collection would be worse. Cleverdon and Keen 
point out that in a collection with a Higher "generality number", that is, 
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with a higher ratio of relevant documents to collection size, performance 
is better with respect to precision [IS] . The average'generality number 
of the ADI collection is over twice that of the Cranfield collection. The 

generality number in a collection of practical size would be even lower 
than that of the Cranfield collection. 

Because the initial search results differ, the total performance 
improvement caused by feedback in the recall-precision curve is used for 
comparison of the two collections. All thirty-five queries are used to 
search the ADI collection. The "increasing alpha strategy" of Riddle, 

Horwitz, and Dietz is the update form&Xa, and five documents are given the. 
user on each iteration. 

Figure 3 shows the differences in total performance precision for 
all recall levels between the initial search and the first a^ second 
feedback iterations for eich collection. In the Cranfield collection, 
relevance feedback causes greater improvement that in the ADI collection. 
Also, the second iteration results in a greater improvement over the first 
in the 200 document collection. ,The difference in generality .between the 
collection would be expected to cause less improvement in the ; larger 
collection [18]. The greater effect of relevance feedback in the Cran- 
field collection could be due to any or all of the following factors: 



a) 



r 



The difference in subject and in the language of the 
subject. It is possible that the terminology of aerody- 
namics is harder' , that is, more limited and prebise, 
than the vocabulary of the newer field of computer science 
^Retrieval of documents from a harder subject* area would 
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"increasing alpha strategy" 



C ADI 

A Cranfield 



1st iteration 
2nd iteration 




ADI and Cranfield Collections 



Total Performance Recall-Preci’sion Curves 



Figure 3 
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expected to be better. v 

* 

b) The difference in collection scope. The ADI collection 
covers a wider subject area within computer science than 
j\ does the Cranfield collection within aerodynamics. A 



c) The difference in variability within the collections. 

The 200 documents were chosen from 1400 documents concerned 
with aerodynamics. The 32 document collection consisted ^f 
short papers presented at a single conference. Since the 
i Cranfield 200 documents vary more in such parameters as- 

a t* 

f vector length and terminology, relevant documents might 



judgments mentioned in Section III. .It is encouraging to 
find that in the more realistic Cranfield environment, 
relevance feedback causes more rather than less improvement 
in performance. 

B. Strategies Using Relevant Documents Only 

Two of the experiments of Riddle, Horwitz, and Dietz' [11] are 



repeated for the Cranfield collection. To simulate \thqir experiments 



equal to 1, and \i and u> equal to 0. Both the "increasing alpha" and 



"constant alpha*' strategies- are employed. 

Figure 4 clarifies the effect of the "increasing alpha strategy" 
and the "constant alpha strategy" for the first, seconcb, and third 



column shows the factors which multiply a relevant document retrieved on 



! narrower subject area should provide better retrieval. 



be easier to distinguish from non-relevant documents. 



d) The difference in query construction and relevance 




with equation D of Section IV, the parameter a is varied; tt is kept 



iterations of a feedback run, evaluating total performance. The R 



0 



\ 
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the initial search, R 1 shows the multipliers affecting a relevant document 
which is not retrieved until the first iteration, and R 2 shows the multi- 
pliers affecting a document not retrieved until the second iteration. 

Figure 4 assumes that a document once retrieved is retrieved on all suc- 
ceeding iterations; in the experimental system this assumption is generally 
correct. 

It is clear that both the constant and i pcrea arhg alpha strategies 
give a document retrieved on an earlier iteration more significance in 
later queries. On the third iteration, the constant alpha strategy assigns 
to a document retrieved on the initial search three times the significance 
it gives a second iteration document (the respective multipliers are 3 and 
1) . The increasing alpha strategy assigns to an initial search document 
twice the significance of a second iteration doc4nent (the respective mul- 
tipliers are 6 and 3) . This effect stems from the use of the previous 
query as an element in the equation. To assign the same significance 

to relevant documents whenever they are retrieved, it is necessary to sub- 
stitute Q q for. Q i in the formula: that is, to let tt = 0 and u) = 1 in 
equation B. This is called the "Q^ strategy" in Figure 2. 

Riddle, Horwitz, and Dietz (11) report that for the 82 document 
collection, the "increasing alpha strategy" performs somewhat better than 
the constant alpha strategy. In the Cranfield ‘collection, the three stra- 
tegies shown in Figure 2 give essentially the same results when N = 5. 

Using the Q q strategy with different relative values of u and a, also does 
not change performance. Query update parameters (in equation D) for the 

, • / 

\ ' 



m 



HIM HUP PH 




six experiments performed are shown in Figure 5. Among all six experiments, 

the differences in normalized precision and recall are less than 0,75% for 
all iterations. 

In total performance, six strategies using only relevant documents ' 
differ very little. Three additional ’relevant only' algorithms are com- 
pared using feedback effect evaluation. One of these strategiesf sets a and 
it .equal to 1 in formula D. This strategy, called Feedback Increment, is 
not equivalent to the constant alpha strategy because the feedback effect 
evaluation provides new documents for feedback on each iteration. Figure 
5 shows that the Feedback Increment strategy gives the same weighting 
effects as does the strategy. These two strategies are identical on 

the first iteration, but on subsequent iterations the feedback effect 
0 

evaluation may retrieve different documents. 

Another strategy using feedback effect evaluation, called 0 + , 

o 

gives added weight to the original query on each iteration by setting cu 

/ 

equal to 4 (a=l , tt=1). a third strategy is Rocchio+, the Rocchio stra- 
tegy without non-relevant documents. In effect, a equals n n and tt 

r s 

equals n g , so a and tt vary with each query. 

Differences in feedback effect among these three methods are 
trivial. For the two overall measures and the recall— precision curves, 
the largest difference is 1.25 percent. The document curves are more 
sensitive in general to performance differences, especially in recall. 

The largest difference is 3% in recall at a 40 document cut-off. Most 

* . t 

differences in all measures favor the Q + strategy. 

» o ■ 
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The 200, document collection seems quite- insensitive to variations 
in the parameters tt, w, and ot. The considerations mentioned in section VI-A 
are probably relevant here also. This insensitivity indicates that per- 
haps the performance for the Cranfield collection is more stable in 
general than for the ADI collection 8 Evidence of comparative stability 
is also reported by Lesk and Salton [19] . The performance differences 
between automatic use of the word stem thesaurus and a regular subject- 
area thesaurus (see Section II) are less pronounced in the Cranfield 200 
collection than in the ADI collection; 

It is evident from the reported experiments that the weight 
assigned to the original query has little effect on retrieval. This 
finding tends to support the conclusion of Crawford and Melzer [12] 
that the original query is not needed after the initial search (Section 

III). The advantage of their strategy over equation B is probably not 

; ! 

caused by the omission of the initial query when relevant documents are 
found, but by the non-relevant document feedback used when no relevant 
documents are found (see Section VI-D) . 

C. Amount of Feedback Output 

The number of documents fed to the user is a critical parameter 
in a relevance feedback system. Of course, performance improves when the 

i 

user supplies more information. This improvement must be evaluated in 
terms of the extra effort required of the user. 

* t 

Figure 6 shows the total performance of the "increasing alpha 



strategy" when 5, 10, and 15 documents are fed to the user for relevance 
judgments. The total performance improvement between the N = 5 and N = 10 
curves might justify doubling the number of relevance judgments the user 
must make; that is, a hypothetical "average" user might be willing to 
double his effort to -achieve such an improvement. Tripling the feedback 
to produce the N = 15 curve might not be justified by total performance, 
especially at the high recall end of the curve. 

Caution is necessary in interpreting the feedback effect eval- 
uation when N is varied, because the feedback effect evaluation gives an 
unfair advantage to runs using few documents for feedback. When five 
documents are used for feedback, ranks 1-5 are frozen on the first itera- 
tion and ranks 1-10 on the second (see Section V-C). When ten documents 
are fed back, however, ranks 1-10 are frozen on the first iteration and 
ranks 1-20 on the second. The difference in results caused by increasing 
the number of documents fed back is therefore minimized by the feedback 
effect process. 

Recall-precision curves for the feedback effect are not presented 

in this section, because the minimizing effect described above is averaged 

* 

3 

into different recall levels for different queries. For example, assume 
that query a has four relevant documents and query b has two. Each query 

i '-* 4 ' 

retrieves the first relevant document *;lth rank 8. When five d ocuments 
are used for feedback, each retrieves the first relevant document with 

i 

rank 6. When ten documents are used, of course rank 8 is 'frozen' by the 
feedback effect evaluation. Consider the effect of these queries on the 
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Varying the Number of Feedback Documents 
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; Total Performance Recall-Precision Curves 

Figure 6 
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recall-precision averages for the first iteration. When ten documents 

, J 

are used for feedback, neither query improves these averages. When five , 

are used, query a improves the recall-precision averages f^om 5% to 25% 

* “ - / 

recall, but query b improves the recall-precision averages from 5% to 50% 
recall. » - 



Thus it is hard to judge the significance of any difference in 
results caused by differences in feedback output. Figure 7 shows the docu- 
ment curves for two iterations of feedback with N equal to 5 , "using the 
Feedback Increment algorithm. Figure 8 uses these curves as a norm for 

i *, • 

i 

comparison with the results of less feedback and of more feedback. The 

* 

document curves of Figure 7 are represented in Figure 8 by the straight 
line at zero difference. The differences between N=10 and NN5 for two 
iterations and between N=2 and N=5 for one iteration are graphed. The 
1^*2 differences, indicated by , are positive at first because ranks 
3-5 are frozen for the N=5 curve. This initial advantage fades after 
10 documents and the N=2 results are lower than the N*=5 norm thereafter. 
The N=10 curves for both iterations are affected by the feedback effect 
evaluation. The first iteration gains higher performance than N=5 after 
13 documents. The second iteration curves cross the N=5 norm after 15 
documents, even with ‘ranks 1-20 frozen. After 40 documents have been . 

retrieved, the differences in both recall and precision for the first 

iteration are slight; the N=10 advantage oh the second iteration is slight 
but consistent. - 

After 20% of the collection has been searched, the differences in 
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feedback effect observed in Figure 8 are quite low. However, the marked, 

r > 

improvement in early retrieval caused by additional feedback might justify 

the additional user effort and system output required, especially if fur- 

,, * % * 

ther feedback iterations are desired. Moreover, it is important to note 
that certain users get no. benefit from any feedback strategy using only 

t 

' n ** # 

relevant documents. These are ^he users who find no relevant in the first 
N documents retrieved. For N-5,l 10, and 15, the number of queries retrie- 



ving no relevant on the initial search is given in Figure, 6, in the table 
below the graph. This table probably explains much of the performance 

» .t 

difference among .the three strategies. Eleven queries in the N=5 case 

produce the same low performance on the initial search, first iteration, 

* 

and second .iteration. These low results are averaged into all the N = 5 
curves. The N=10 curve is pulled down by only five such queries, and the 
N=15 curve by only two. In the N=5 case, one quarter of the users are 
not assisted by the chosen feedback strategy , a large proportion for a 
practical retrieval system., ' For these unlucky users, feedback of more 

documents is worth the effort. j 

* * • * 

v, 

' ’A variable feedback, strategy is here proposed which might save 



S 



effort to the average, user and give better service for more effort to the 

\ ; 

user Who does not find a relevant document early in the initial search. 

, * V , 

Each user is fed retrieved documents until he finds one relevant document 

* 

» J * 

-that he hasn't seen on any previous iteration. The relevant document found 
is immediately used ito produce a new. query. The success of this strategy. 

f 0 r 

defends * on the ability of a single relevant document to improve the 
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retrieval performance. 
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Figure 9 show$>€he total performance results of two iterations 
where the "user" is instructed to searck the retrieved documents until 

P \ 

* 

8 

he finds one new relevant document or until he has seen 15 documents. The 

i f 

curve in Figure 9 shows what happens when the user is instructed to 
find two new relevant documents. Only one iteration of the latter scheme 
was run because several queries do not have » four or more relevant docu-. 
ments. . 

* * o 

The first iteration feeding- back one relevant document- begihs near . 

i * S'*' 

the N=^5^curve of Figure 6 but by 50% recall has dropped near the f itst 

iteration N=5 curve, which has been superimposed on Figure 9. The table , 

1 * 

** % 

% 

below the graph shows that the "average" user* had to scan only four docu- 

\ • . 

ments for feedback in order to achieve the performance displayed in the 

first iteration "o" curve. By contrast, each user looked at exactly 5 

1 J 

documents to produce the first iteration N=5 cun/e. The first iteration 

5 , •• * v. •' 

strategy of feeding; back only one relevant document gives equal or better 

» 

j , 1 4 • 

performance for less average effort. N 

The second iteration "o" curve requiredvthe average user to 

_ J) 

search 5.9 more documents, or a total of/ 9.9 documents. This curve drops 
» ¥ ~ ^ , * 
below the first iteration N=10 curve (10 u documents scanned) at roughly 55% 

’ * % ; j 

* ' « - ' * \ 

recail (the first iteration N=10 curve from Figure 6 is superimposed oh? 

' '• * > ' * * , .. 

‘ • - L s , 

Figure 9). The user desiring high; precision and who may be less interested 

V * # * * 

in high recall might be wise to feed back one relevant d° cument f° r each 
of two iterations. However, the user needing higher recall should instead 
look at ten documents retrieved on the initial search. (These statements 






w— — • initial saarch , O feedback 1 new relevant^ * 

— 1st iteration & feedback 2 new relevant * 

* 

— - — 2nd iteration' ‘ “ , * 

% * 

?- * 4* 

(nc character) superimposed curves from Figure 6,, 1st iter. 
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apply to the , t "average" user). It is also seen from the table ''below the 

A * t 

* * 

graph that for the second iteration "o" curve one quarter of the users 

• & 

t * , . ' 

cannot find a new. relevant document, and thus after the first iteration 

« 

£.4 % * 

these users search 13 documents to no avail. Such performance would be 

quite annoying in practice. , 

» u 

• i 

% * • 

The average user who searches for two relevant documents in the 

* $ 

initial output looks at 7 documents. His recall-precision ("£y") drops 

below the first iteration N s 10 curve at 65% recall. Although 9 out of 42 

/ users do not find two relevant documents in the first 15, all but two of 

/ them find one relevant to feed back to the system. 

The user who feeds back two relevant documents on one iteration 
« ^ # 

("£") achieves better performance than does the user who feeds back one 

i - * * 

relevant document on each of two iterations (second iteration "o"). This 
result shows that the second relevant document retrieved on the initial 
search is more valuable for feedback than the first new relevant retrieved 1 
on the first iteration by the total performance retrieval method. Feeding 
back one relevant document on the first iteration evidently pushes down 
some relevant documents that are valuable for retrieval. .This finding pro- 
vides a strpng argument against the proposed variable feedback strategy, 
at least where higH recall is desired. Perhaps some sort of combination 
strategy might be optimal; for instance, the user could be instructed to 
feed back all relevant documents injihe first five retrieved, but if none 
are found in the first five to keep looking and feed back the first reie- 



Y 



vant document found, x 
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One iteration of feedback effect performance of variable feedback 

t * 

• s * • , 

<1 • 

arid of the combined strategy described above is presented in Figure 10. 

% 9 • * * 

I * f 

The differences between variable feedback (feeding b^ck one relevant). and 
constant N=5 feedback are graphed (o). After nine documents have been 

retrieved the feedback advantage using., the first relevant document -for 

1 . 

feedback is evident. The combination strategy (graphed £) that retrieves 

* 1 '<v * ' 

\at least- five documents shows a greater improvement over N+5 than does * 
variable feedback. Thi'p result shows that this variable feedback algorithm 

is advantageous only for those queries that retrieve no relevant among 

1 « • 

For those that do retrieve relevant within the first 

„ / 

0 t * .. 1 

five documents, the constant N»5 strategy gives better feedback effect 

• i 

* 

V, v 

results. Figure 9 shows that the combination strategy requires the average 
user tojlook.at 6.4 documents, as compared to the ,4.0 documents retrieved 
by variable feedback. • ' <- - . — .... 

N. i , 

— -^.Total performance and feedback effect results support three con- 

I - • -v. , ’ 

elusions about feedback strategies that use only relevant documents: 

< . • . ’ • 

1) Retrieving more documents does improve both types of 

performance (except where the rank-freezing of feedback 

* 

r • * 

effect evaluation prevents any improvement). Further, 

notable improvement cam be obtained by searching further 
if.no relevant documents are retrieved in the first N. 



2) The total performance for 5, 10, and 15 documents indi- 

" cates that when N is constant for all queries, the average 

# * s 

increment of improvement obtained tends to become smaller 

as more documents are used for feedback. 

* 3 

3) However, a comparison of the variable feedback and com- 
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. * * » ; i v • ^ 

bination strategies show that the first relevant.’ document re-^ 
retrieved generally does not contain enough information for ^re- 
trieval, and that documents retrieved soon after the first 
(N-5) can add useful information. In their study cited in 
Section III, Crawford and Melzer used onljr one 'very relevant 
document for retrieval. The finding of £his study Vould in- 
dicate that if several documents are almost equally relevant, 
all of them should be used. * 



j ' * 

| TtJese three conclusions lead to a recommendation that N be set to 

4 * \ 

v some value that most users would consider reasonable, but that for some 

v ' w 

queries N should be raised until at least one relevant document is retrieved 
This recommendation endorses the "combination strategy" for a retrieval 

system using only relevant documents « 

/ . 

y 

D. Strategies Using Non-Relevant Documents 

Rocchio's update formula (equation A) considers the information 
contained in the set of non-relevant documents- retrieved (S) to be as im- 
* portant as that . contained in the set of relevant documents retrieved (R) . 

If this is the case, the strategies so far examined disregard half the 
available feedback information. Further, information from non-relevant 
documents retrieved on the initial search might help those, users who 

ji * 

retrieve no relevant documents on the initial search (see Section III 

* ■ * . o 

and reference 6). Figure 6 shows that there are eleven such users out 
of 42 when N equals 5. 

[ However, problems arise in using the non-relevant documents in the 

SMART experimental system. There is no provision for negative weights in 

/ . - '• \ 

- • , / ' . ■ . 

. \ 

■ ' ; . . ' /"■ : : 



o 





the query vector. Also, queries and documents cannot be normalized to the 

* , , 

same lengch* for query updating. There is some danger, therefore, chat the 

- , / 

query will be reduced to nea.rly the zero vector when the documents of S 



are subtracted from it. Riddle, Horwitz, and 



Dietz [11] try to avoid this 



danger in their "negative heuristic strategy" by feeding back only the 
•first two non-relevaht^documents retrieved. 

The Rocchio strategy adjusts the multipliers for each q^ry bo as 

I t . 

__ original query, the sum of the j relevant, and the sum of the 



non-relevant equally, and uses all retrieved 



• \ 

This study compares the Rocchio and 



documents. 

'negative heuristic' strategies 

# 

measure. All comparisons are 



the Q q strategy (see Section 



using total performance and feedback effect 

/ ■ 

made with N equal to 5. Figure 11 compares 

VI i. B) with a strategy called 'Dec Hi', thajt decrements each, query by 
subtracting from it the first retrieved norj-re levant document. In the 

queryjupdate formula (eouktion D) , the parameter values and effective 

• * ; ‘ */ . ‘ ■! ' '• ' . 

update formulas for these strategies are: j , 



d- 



Q q : 71=0, u)=l, a=l, U=0, n a =N, n^=0 



.N 



^i+1 



■vl-. 









Dec Hi: tt=0, u)=1, a=l, U“-l, n a *N, n fa =l 



N 



^i+1 



- Q i + Iv- S i 



Figure 11 shows that the average results are consistently better for the 
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"dec hi' strategy, especially on the second iteration.. 

•? • / ’ 

For this experiment^ the implications of the total performance 

normalized precision and recall^ results given in Figure 12 seem inconsistent 

withN^those of the recall-precisim curves of Figure 11. On >the first iter* 

ation, the recall-precision curve for the dec hi strategy is above that for 

, the normalized recall fqr the first 

iteration is lower for dec hi, although precision 'is one percent higher. 



0 at all recall levels. However 
o 



(On the second iteration, the normalized recalls are the same, and the 

I i - . 

J i • ' * * 

normalized precision for dec hi is five percent higher),. This apparent 

• i- . * ' * . 

paradox can be understood by considering the normalized recall measure. 

I ] , , , 

Each document retrieved is assigned a "rank" in order of retrieval 



(rank 1 is the document retrieved first) The normalized! recall measure 

is based on the sum of the ranks pf all relevant document's in the search. 

■ * • • • ‘ 

A change in rank affects this measure equally regardless of the magnitude 

. • x, 

of the rank. That is, a change from rank 195 to rank 191 is equivalent to 
a change from 5 to 1 in its / j&ffect, on normalized recall. The same is not 
true for normalized precision. It seems evident that while the dec hi 

... a ( ' ’ / 

strategy increased the rank (1 is considered highest) * of some of the rele- 

; « 

vant documents, it decreases the ranks of others that are, on the average, 
of lower rank already. This explains the phenomenon of higher precision 

at all levels of recall but lower overall normalized recall. 

1 1 ♦ * . ' * ^ 

\ . 

Figure 13 shows -how much the dec hi strategy helps the 11 users 

V* ' * 

who receive no relevant documents in the fiirst 5 on the initial search. 

\ ' ' ' - : * 

For the "inc only" strategy, the .initial\ search and all subsequent iter- 
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; * ‘ 

ations are the same for these 11 users; the precision being about 10 percent. 

* # 

\ , 

Feeding back one non-relevant document fetches at least one relevant docu- 
ment onAthe first iteration for 7 of these 11 users. For some of these 

f \ /• « 

queries, some low ranking relevant documents are pushed still lower at first. 
The relevant documents which are raised to the first 5, however, provide a 
second iteration query which often raises, these same low documents again. 

The fixkt iteration curve thus shows the mpst improvement at low recall, 
while the second iteration shows great improvement all *along the recall- 
precision curve. 

Since the improvement ih total performance for the 11 "bad" queries 

a » i 

is so striking", .it is natural to wonder whether this strategy is helping 
or^ hurting the other, 31 users/ Figure 14 shows a different curve for the 

dec hi and>Q strategies run* only on the 31 queries that retrieve at least 

. * 

one relevant in the first 5 documents. A point above the zero line indi- 
cates that dec. hi is better than Q at that recall. Both iterations are 
better for dec hi, especially at the high recall end of the curve, where 
they differ by as much as six percent. Since the dec hi strategy improves 

‘ • . ■ j 

the results even for the "good" queries, a heuristic strategy that selects 
only Some of the queries (as does the "negative heuristic strategy" of 

i 

Riddle, Horwitz and Dietz) for the dec hi algorithm appears unnecessary 

t * / 

. , t 

in this environment. * 

Figure 15 represents a total performance difference curve comparing 

; 

the "dec hi" strategy with the alternative of decrementing the query by 
subtracting the two highest non-relevant documents retrieved on each iter- 
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N = 5 



1st iteration 



/ 



2nd iteration 
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For the 31 Good ^Queries , a Comparison Between 
Decrementing the Highest Non-Relevant 
and Incrementing the Relevant Only 
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‘ Total Performance Difference 
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ation (called “dec 2 hi"). It shows that decrementing only one non-relevant 
gives generally better, results ; the largest difference being a five percent 
hump at 40% recall in the* second iteration. In Figure 12 the journalized 
measures show the sjme ^relationship; the dec 2 hi strategy is one or two 

i. 

percent lower on each iteration than is dec hi. This result may be due to 

the danger mentioned earlier, that the non-relevant documents may be sub- 

* * 

tracting out most of the query. (Only one query completely disappears using 

i ' \ * 

this strategy, and jj.t is erased also by the dec hi and Rocchio’s algorithms). 



£ 



It might be possible to overcome this "disappearing query" phenomenon by 



U 



juggling the parameters tt , w, a, and y, without introducing .the complications 

t * * 

of Rocchio’s normalizing method. i 

t 

It was mentioned in Section VI-C that much of the improvement be- 
* * 
tween the N=10 and the N=5 curves of Figure 6 might be caused by the im-^ - 



provement on the 'six queries* that fetch a relevant document within the first 

10 but not the first 5 on the initial search. Seven users of the unlucky 

• : Y 

11 are helped by the dec hi strategy; that is, the dec hi strategy provides 



useful feedback for one more user than does the relevant document stra- 



tegy with N=l6. It is pertinent to ask if the dec hi algorithm has, in fact 

10 curve. RjLgure 16 shows a dif- 



-/ 



• / Jl 

attained the total performance" of the N= 



ference curve between the N=10 curve of Figure 6, and the dec hi curve of 

I * \ 

Figure 11. iThe"N=l'0 curve is higher f or *the first iteration, ovejt five 

y 

percent higher at the' high recall end of the qurve. This is understandable 






'X 



in view of the lowering of low-ranking relevant documents on the first 

Jr 

iteration, discussed earlier in, this section. For the second iteration, 
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1st iteration 
2nd iteration 




(Feedback 10) - (Dec Hi) 
h Comparison Between Feeding Back 10 Documents 
but Increasing Relevant Only, 
and Feeding Back 5 Documents 

but Decrementing the Highest Non-Relevant 

. ' 

Total Performance Difference 



Figure 16 
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the dec hi curve is slightly better at the low recall end and only Slightly 

worse at recalls between sixty and ninety percent. Considering that the 

dec hi curve only requires half as much user effort (5 documents scaped 

instead of 10), the total performance results str6ngly~favor this non- 

relevant document retrieval strategy. * ‘ 

The feedback effect results are not as encouraging. Using the two 

overall normalized measures andphe recall-precision curves, three stra. 

tegies are compared and their differences tested with the two significance 

test/ described in Section V-A, the T-Test and the WSR Test. The three 

str/tegies that are compared in feedback effect performance are: 

. ' ' ' ' - V 

' J l) Feedback Dec Hi: The Dec Hi strategy with the feedback effect 

retrieval method, using the first retrieved non-relevant 

V 

document. f 

2) Rocchio: Rocchio’s recommended strategy (without nor- 

malized- vectors) using all retrieved documents. 

3) q +•. The strategy described inaction VII-B, t^hat gives 
added weight to the original query and uses only relevant 



j 



documents. 



( 



I 



The differences in feedback effect recall-precision' curves among 
th/se strategies are not significant. The largest dif ferences^und 
were 1.3% in the ‘first iteration, significant at th<f 30% lopJ. and 2.0% in 
the second iteration, significant at the 11% level. The largest dif- 
ference between Q paiid any other strategy was 1 . 2 %/lilnif icant at the 
64% level. I Thes/signif icance figures were obtained using the less con- 
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iS'ervative T-test. Figure 17 shows the differences among the three stra- 
tegies in normalized recall and normalized precision. The feedback effect 

9 I 

results agree with the total performance results in Showing a drop o in nor- 
malized precision for the two -non-re levant document strategies on the . first 
iteration. The five percent difference between Q q + and feedback Dec! Hi is . 
significant at the 6% level, and the six percent difference between Q q + 
and Rocchio is significant at the 3% level, according to the T-test. How- 
ever, the Wilcoxon Signed-Rank Test (WSR) indicates that the two algorithms 
do not give significantly different results. The significance level convr 
paring Q + and feedback Dec Hi .is 46%, and that comparing Q + &nd Rocchio 



is 95%. 



These different significance levels must be considered in the light 



of the characteristics of the two significance tests. The T-test takes 



X 



account of magnitude, the WSR test considers only rank. Evidently, differ- 

JiO n-: 



ences favoring Q^+ and differences favoring the 



■relevant document stra- 



tegy are^mixed in rank, producing insignificant results on the WSR test. 
Yet, sode of ttie results favoring Q q + (not, all , because the ranks are 
mixed) must be v£ry large in magnitude, to give significant indications on 
the T-test. Thus, for some queries, the Rocchio and Feedback Dec Hi algo- 
rithms mus 4 be much less effective than Q. q + as measured by normalized 

A. 

recall, while remaining effective as measured by normalized precision. 

The 'normalized recall obtained by feedback effect evaluation shows 
the same behavior as the total performance normalized .recall on the first 



iteration. Both evaluation methods lead „to the conclusion that the use of 



a 



r 



i 
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i / . 

„ % 

non- re levant documents for feedback apparently raises the ranks of fairly 

* » 

high-ranking relevant documents, and at the same time lowers the ranks of 

some low-ranking relevant documents on the first iteration. 

\}s ' ■ 

The significance levels obtained by comparing the first and second 
iteration results, to the initial search result within the strategies are 

v 1 ‘j* 

very informative. Figures 18 and 19 show the performance of algorithms 

0 + and Rocchio respectively. The significance of the gap between the 
o 

initial search and each iteration is tested, using the more conservative 



WSR test* , 

Looking at the three recall-precision graphs, the average perfor- 
mance t>f the three algorithms seem quite similar. In fact, the differences 
in average performance are not significant. Yet, the significance levels 
displayed in Figure 18 differ greatly from those displayed in Figure 19. 

For the Q + strategy, the differences between the initial search 
and each feedback iteration are^signif icant. ' On the first iteration, the 

two overall measures and the precision differences from 20%, through 50% 

\ • ' V j . . 

"recall are significant at the 5% level or ldss, and only aj 70 and 80% 

recall are the precision differences^po4 significant at the. 10% level. 

V v • 

\ 

On the second iteration, the performance difference is significant at the 



5% level for all points except 70% recall. For the Rocchio strategy, 



'however , only one measure (precision at 50% recall) shows a significant 



difference between the first iteration and initial search at the 10% level 

i- <■ ■ ■ ' ‘ • 




« For these comparisons, a one-tailed significance level is appropriate, since 
performance is expected to improve. To obtain one - tailed , values , the 
reported two-tailed values must be divided by two* That is, the proba- 
bility that the first iteration is no better than the initial search is 5% 
or less except at 70 and 80% recall. - 
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Difference 


T 

Test 


1 

Q + strategy 


Iter ' 
1 


6.1 


3.4 


24.1 


2.9 


15.4 


°- minus 
Rocchio strat- 
egy 


Iter 

2 


4.4 


17.7 


48.9 


i 

2.0 


43.0 


Q + strategy 


Iter 

1 


5.4 


5.7 


98.5 


2.1 


31.4 

1 • 


minus- 

• Dec, Hi strat- 
egy 


Iter 

2 


7.0 


9. .4 


45.6 , 


3.0 


i , 

29.6 
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Statistical Comparison of Feedback Effect/ 
Relevant and Non-Relevant Document Strategies 
Normalized Recall and Precision 



Figure 17 
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Even on the v,eccnd iteration; only six of the twelve differences are 

I 

significant at the 10% level or less, two at the 5% level or less. 
Significance results for the feedback Bee Hi strategy are similar. 

When comparing first and second iterations, the Q q + results 

r 

are no longer more significant than the Rocchio results. In fact, the 
Rocchio results are significant (10% level Cat less*) for eight of the 

I 

twelve measures; the C Q+ results for only five. The significant im- 
provement between first and .second iterations occurs at the high recall 
end of the Q o + curve, while the improvement for the Rocchio strategy 

is more evenly distributed. 

This difference between strategies in the significance of the 
improvement over the initial' search leads to a general conclusions 
Performance on all measures is less consistent for the non-relevant 
document strategies than for the Q q + strategy. However, since the 
average magnitude of this improvement is equal for the three alg 
rithms (from the significance results presented in Figure 5) , it must 
be true that the Rocchio and Dec Hi strategies are better for some 
queries endorse for others than is the more consistent Q o + strategy. 

The total performance results of Figure 13/" indicate that the 
queries that retrieve no relevant documents on the\nitial search are 
helped by the non-relevant feedback strategies. "Figure 20 supports 
this conclusion with evidence that even using feedback effect eval- 
uation, the Rocchio strategy provides better performance on these 
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eleven queries. Figure 14 adds the information that on the average, 

the total performance on the remaining 31 queries is not hurt by 

$ 

#■ 

negative feedback. The preceding paragraph leads to the conclusion 

that^in feedback effect the Rocchio strategy gives worse performance 

* * 

on some of "these queries. This conflict between total performance 
and feedback effect results requires further investigation of sub- 



\ 



groups of queries 



The document curve differences presented in Figure 21 provide 
new information about the (performance of the negative feedback 
strategies. The Q^+ strategy is taken as the norm, and the Rocchio 
and Feedback Dec Hi differences from Q q + are graphed. For the first 
fifteen or so documents retrieved, both Rocchio and Feedback Dec Hi 



T 



are superior in feedback effect performance to Q^+. After 40 docu- 
ments have been retrieved, both are much worse than Q q + in recall, 
though about the same in precision. The recall-precision curves of 
Figures 18 and 13 average out these two extremes, and lose the sig- 
nificant information. 

Fig/ure 20 strongly supports the conclusion implied by nor- 
malized recall) that on the first iteration non-relevant document 
strategies tend to raise some relevant doucments, but to lower others 

that are already low in rank. The average advantage of the non- 

« ^ * 

relevant document strategies appears early in the retrieval process. 
After 20% of the collection has been scanned, the Q + strategy is 

A ^ 

clearly superior in recall. The rank-freezing of the -feedback effect 
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N = 5 



O initial search (Q q + strategy first iteration) 



& first iteration 

* 

£ -second iteration 




Rocchio Strategy 
On Eleven Bad Queries 
Feedback Effect Recall-Precision Curves 



’igure 20 
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Recall and Precision Differences 
Comparing Rocchio and Dec Hi Strategies 

to Q + norm 

Feedback Effec? Document Curves 



number of documents 
retrieved 



Figure 21 
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evaluation affects the ranks of the earliest documents retrieved, so the 

‘ ■'N 

recall-precision curves of the non-relevant document strategies appear 

& 

superior in total performance but only equal in feedback effect. Norma- 
lized recall expresses the later large drop in recall_i/hich overwhelms the 
advantage of negative feedback. Thus, the (Document curves support 
and clarify the tentative deductions made from the less detailed measures 

i 

presented earlier. _ 

Total performance comparisons encourage the use of algorithms that 
employ negative feedback of non-relevant documents. However, the feedback 
effect results indicate that the performance of negative feedback algorithms 
is highly variable. These findings encourage a search for a means of pre- 
dicting the appropriate strategy for a given query. For this reason the 

* ' '** ' . 

characteristics of selected subgroups of queries are explored in the 

following section, , 

* * 

E. Characteristics of Query Subgroups 

To investigate the performance of positive and negative feedback 
in more detail, the available queries are split several ways into pairs 

of query subgroups. Each * subgroup pair represents a contrast based on 

/ 

one or two characteristics. For example, all queries, with four or fewer 
relevant documents might form one subgroup of a pair, and all queries 
with five or more relevant documents might constitute the contrasting 
subgroup of that pair. The six queries that retrieve all relevant documents 
rank 5 or less on the initial search are omitted from analysis 
because relevance feedback cannot improve the feedback effect performance 
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on these queries. Figure 22 lists the characteristics used for selection 
and describes some subgroups for which comparisons are reported. 



Each subgroup is statistically compared to the contrasting sub- 
group-using the Wilcoxon Rank Sum test (see Section * V) . -All f indings of 



less than 10% significance are reported in this section. If the word 



the indicated comparison is greater than ten percent, and the 'null hypo- 
thesis' of no difference between subgroups is supported. ' Although signi- 



Second, the numbers of queries in each subgroup is low, and statistical 
significance is c^if'i^cv^it to pyove for small samples. For these reasons 



significance levels from five to ten percent may indicate areas for pro- 
ductive investigation using larger query collections and perhaps more 
sophisticated statistical techniques. Twenty-two variables are used for 
WRS comparisons within each subgroup pair. Two are not generated by the 
search process; the number of concepts in the initial query and the number 



are correlation of the modified query with the original query, feedback 



and all three measures are calculated for two iterations of two feedback 

* • - 

strategies, the positive feedback Q q + strategy and the Rocchio algorithm 



•null' is used in the WRS column-of^a figure, the significance level of 



ficance level from fiv ten percent are not normally considered mean- 
ingful, they are reported here for two reasons. First, the Ws test is 
conservative when too many ties in rank occur , and the data contains ties 




of relevant documents (2 vars. ) . - The three search-related measures used 



r effect normalized recall, and feedback effect normalized precision. Nor- 



malized recall and precision are^calculated for the initial search (2 vars.) 
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Selection { 

Characteristic ! 

i 


Group Name 
(Size) 


J 

' j 

Group Description 


) 

Initial i 

. Search 
Retrieval < 

1 

j l 

1 

j 


Eleven Bad l 

(ID. ! 

i 

Twenty-F ive 
(25) 

f 


No relevant documents are retrieved with 
rank 5 or less on the initial search. 

Some but not all relevant documents are 
retrieved with rank 5 or less on the 
initial search. 

i 


* Relevance 
Feedback , 

1 f 

Performance 


Good Per- 
i forma nee 

\ (16) 


At least one feedback strategy retrieves 
all relevant documents with rank 15 or 
less within three iterations. ] 


1 

i 

i 

i 


Bad Per- 
- forma nee 
! (20) 


No feedback strategy retrieves all relevant 
documents with rank 15 or less within three 

* i 

' iterations. * 


Correlation • 

of Modified 
Query „with 


Six subgroups are chosen. The number of 
queries in each is given in the following 
table : 



Original 
Query 

% 



Relevance 

Feedback 

Strategy 



Low 

High 



Strategy 


* 


_ 




Rocchio 


Iteration 


1 


CM 

1 

H 

CM 


1 


2 


1-2 




17 


16 13 


16 


19 


17 




19 


20 17 


20 


17 


19 



V 

(13) 

Rocchio 

(15) 



The Q + strategy retrieves more documents 

with rank 15 or less in three iterations,, 

■! 

The Rocchio strategy retrieves more docu- 
ments with rank 15 or less in three itera- 
tions. 



Number of 
Concepts in 
Original Query 
and* 
Number of 
Relevant . 
Documents 



High-Low 

or 

Low-High 

(17) 



Similar^ 

(19) 



Queries naving relatively many relevant 
documents and relatively few concepts or 
vice versa: \ 

From 2-3 relevant and 7+ concepts (2) 

From 4-5 relevant and 10+ concepts (5) 

From 5-6 relevant and 3-6 concepts (5) 

7+ relevant and 3-7 concepts (5) 

Queries having a number of concepts 
and a number of relevant documents 
similar in magnitude: 

From 2-4 relevant and 3-6 concepts (8) 

From 4-6 relevant and 7-9 concepts (6) 

6+ relevant and 8+ concepts (5) 



L ' 



Some Query Subgroups Investigated 
Figure 22 
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which uses negative feedback (12 vars.). For normalized recall and pre- 



cision, the improvement caused by feedback over the initial search is 



used for comparison to remove the effect of initial search differences 



between subgroups. To provide a direct comparison between positive and 



negative feedback, the differences between the Q q + and Rocchio strategies 



in normalized recall and precision for two iterations are used (4 vars.). 



Finally, the difference between the first and second iteration correla- 

s 

tion of the modified query with the original query is calculated for each 

strategy (2 vars.). Obviously significant relationships such as the 

« s , #• l * 

difference in number of relevant documents between queries with four or 



fewer and queries with five ob more relevant douements are not reported. 

» \ 

Normalized recall and normalized precision were chosen for the 



subgroup comparisons because they are overall measures of retrieval. How- 

t „ 

ever, the analysis in the. previous section indicates that the normalized 

» • 

figures are riot representative of overall performance as indicated by the 
recall-precision curves and the document curves. In .particular , normalized 
recall shows a large drop for the Rocchio strategy, and neither recall 
nor precision reflects the initial advantage of the Rocchio strategy dis- 
played in Figure 21. Therefore, the normalized measures may not be the 
best choice for meaningful comparison of positive and negativk feedback. 

Figures 24, 28, and 29 in this section display recall-precision 
curves. Unfortunately, significance tests between subgroups for recall- 

precision curves are not available. However, in three subgroup pairs, 

\ - * ‘ 

selected by strategy, performance, and number of relevant documents, 










Wilcoxon Signed Rank tes^s of the difference between the Q q + and Rocchio ‘ , - 

strategies within each subgroup were made. All differences were signifi- 
cant in both strategy subgroups y^po differences were significant in any 
other, subgroup. 

In the figures of this section, the average values of variables j 
as well as the WRS probabilities are presented. It should be noted that 

t 

the WRS probabilities do not indicate the significance of differences in . 
average value , but the significance in differences in rank sum when all 
queries in both subgroups are ranked. The average value is reported * 
because it is a more familiar figure and conveys more intuitive meaning 



than the rank sum. . 

The firsts subgroup pair listed in Figure 22 is familiar from 

the previous section. . Figure 13 and 20 present total performance and 
feedback effect recall-precision curves for the 'eleven bad' group, both 
showing an advantage for the Rocchio strategy. Figure 23 presents some of 
the significant WRS findings for this group. The average number of rele- 
vant documents for the eleven bad queries is 4.3, contrasting with 5.6 
for the remaining ' twenty- five queries. The WRS probability that these 
subgroups represent populations that have the same distribution of number 



of relevant documents is less than ten percent, so the difference is of 
doubtful significance. When the Q q + improvement is compared to the Rocchio 
improvement, the normalized recall and precision indicate an advantage for 
the Q + strategy in both subgroups. This finding contradicts the recall- 
precision curves presented earlier, and is misleading for the reasons 
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